New visualization tools for numeric distributional data tables
From pre-processing to interpretation
Antonio Irpino, Ph.D.
Dept. of Mathematics and Physics
University of Campania L. Vanvitelli
Caserta, Italy
Thursday, the 9th of November, 2023
Layout
1) Aggregate and distributional data
Distributions are the numbers of the future.
Schweizer (1984)
2) Visualizing a table of (1D) distributions
3) Visualizing a single row (through eye iris or flowers)
The greatest value of a picture is when it forces us to notice what we never expected to see.
John Tukey
4) Visualizing large distributional data tables (extending an heatmap)
Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.
John Tukey
5) An application on Chile climatic data
Aggreagate and distributional data: numeric distributional data
Let’s see an example: BLOOD dataset from the HistDAWass R package.
It is a classical (in the Symbolic Data Analysis community) dataset describing
- 14 typologies of patients;
- 3 distributional variables;
after aggregating a set of raw data from a hospital. See: Billard and Diday (2006)
| name |
V1 bins |
p1 |
V2 bins |
p2 |
V3 bins |
p3 |
|
[80 ; 100] |
0.025 |
[12 ; 12.9] |
0.050 |
[35 ; 37.5] |
0.025 |
|
[100 ; 120] |
0.075 |
[12.9 ; 13.2] |
0.112 |
[37.5 ; 39] |
0.075 |
|
[120 ; 135] |
0.175 |
[13.2 ; 13.5] |
0.212 |
[39 ; 40.5] |
0.188 |
| u1: F-20 |
[135 ; 150] |
0.250 |
[13.5 ; 13.8] |
0.201 |
[40.5 ; 42] |
0.387 |
|
[150 ; 165] |
0.200 |
[13.8 ; 14.1] |
0.188 |
[42 ; 45.5] |
0.287 |
|
[165 ; 180] |
0.162 |
[14.1 ; 14.4] |
0.137 |
[45.5 ; 47] |
0.038 |
|
[180 ; 200] |
0.088 |
[14.4 ; 14.7] |
0.075 |
|
|
|
[200 ; 240] |
0.025 |
[14.7 ; 15] |
0.025 |
|
|
|
[80 ; 100] |
0.013 |
[10.5 ; 11] |
0.007 |
[31 ; 33] |
0.046 |
|
[100 ; 120] |
0.088 |
[11 ; 11.3] |
0.039 |
[33 ; 35] |
0.171 |
|
[120 ; 135] |
0.154 |
[11.3 ; 11.6] |
0.082 |
[35 ; 36.5] |
0.295 |
|
[135 ; 150] |
0.253 |
[11.6 ; 11.9] |
0.174 |
[36.5 ; 38] |
0.243 |
| u2: F-30 |
[150 ; 165] |
0.210 |
[11.9 ; 12.2] |
0.216 |
[38 ; 39.5] |
0.170 |
|
[165 ; 180] |
0.177 |
[12.2 ; 12.5] |
0.266 |
[39.5 ; 41] |
0.072 |
|
[180 ; 195] |
0.066 |
[12.5 ; 12.8] |
0.157 |
[41 ; 44] |
0.003 |
|
[195 ; 210] |
0.026 |
[12.8 ; 14] |
0.059 |
|
|
|
[210 ; 240] |
0.013 |
|
|
|
|
|
[155 ; 170] |
0.067 |
[10.8 ; 11.2] |
0.133 |
[33.5 ; 35.5] |
0.133 |
|
[170 ; 185] |
0.133 |
[11.2 ; 11.6] |
0.067 |
[35.5 ; 37.5] |
0.267 |
|
[185 ; 200] |
0.200 |
[11.6 ; 12] |
0.134 |
[37.5 ; 39.5] |
0.267 |
| u14: M-80+ |
[200 ; 215] |
0.267 |
[12 ; 12.4] |
0.333 |
[39.5 ; 41.5] |
0.133 |
|
[215 ; 230] |
0.200 |
[12.4 ; 12.8] |
0.200 |
[41.5 ; 43] |
0.200 |
|
[230 ; 245] |
0.067 |
[12.8 ; 13.2] |
0.133 |
|
|
|
[245 ; 260] |
0.066 |
|
|
|
|
The first two and the last typology of patient in the BLOOD dataset.
Numerical distributional dataset
A distributional dataset is a classical table with \(N\) observations on the rows and \(P\) variables, indexing the columns, such that the generic term \(y_{ij}\) is a numerical univariate distribution
\[y_{ij}\sim f_{ij}(x_j)\] where \(x_j\in D_j \subset \Re\) and \(f_{ij}(x_j)\geq 0\),
- \(\int\limits_{x_j \subset D_j}f_{ij}(x_j)dx_{j}=1\), if the distribution has a continuous support;
- \(\sum\limits_{x_j\in D_j}{ f_{ij}(x_j)}=1\), if the distribution has a discrete support.
The basic plot for the \(i\)-th observation
The \(i\)-th observation is the vector \(y_i=[y_{i1},\ldots,y_{ij},\dots,y_{iP}]\)
Steps :
Domain discretization
- For continuous variables. For each variable \(Y_j\) we consider the domain \(D_j\) and, fixing a \(K_j\) integer value we partition \(D_j\) into \(K_j\) equi-width intervals (bins) of values, such that: \[D_j=\left\{ B_{jk}=(a_k,b_k], \lvert \, b_k>a_k,\, k=1,\ldots,K_j\, , \bigcup_{k=1}^KB_{jk}=[\min(D_j),\max(D_j)],B_{jk}\cap B_{jk'}=\emptyset, \text{ for } k\neq k' \right\} \]
- For discete variables. For each variable \(Y_j\) we consider the domain \(D_j\) and, being \(\# D_j=K_j\) the cardinality of \(D_j\), we consider the elements of \(D_j\).
Choice of a divergent colour palette_ We consider a divergent color palette with \(K_j\) levels, such that \(K_1\) represent the lowest category and \(K_j\) the highest one.
Stacked percentage barcharts We compute the mass observed in each bin/category for each \(y_{ij}\)
For the \(Y_i\) observation, \(P\) bars are generated. The order of the bar can be decided accordingly to the user preferences, or can be suggested by a correlation analysis for all the data in advance (one may cluster the distributional variables using a hierarchical clustering based on the Wasserstein correlation and then using the order returned by after the aggregation).
Polar coordinates Polar coordinates allow to represent the stacked barcharts as circles that mimics an Eye Iris.
We called this plot Eye Iris plot (EI plot.)
Example using BLOOD data
The extremes of the domains of the variables
Range of Cholesterol [ 80 ; 270 ]
Range of Hemoglobin [ 10.2 ; 15 ]
Range of Hematocrit [ 30 ; 47 ]
Choice of \(K\) and of a color palette
We fix \(K=50\) and we will use a color palette from Red (low values), passing through Yellow (middle values) to Green (high values).
Now, let’s take the first observation
Recode the distribution according to \(K=50\) partition of the domains.
Since the bins represent classes of values, we can consider them as ranked levels of the domain.
We propose to see all the three distributions using a stacked percentage barchart as follows. Note that each level of color has a area that is proportional to the mass associated with each bin.
![]()
The dashed line is positioned at level \(0.5\) suggesting where the median of each distribution is positioned taking into consideration the level of color associated with the bin of the respective domain.
But, this kind of visualization is not so immediate for comparing several observations. Let’s see an example:
For this reason, we propose to use a plot based on polar coordinates, but adding pupil for reducing the distortion due to the polar transformation, as follows:
Since a human is able to catch eyes shapes and color, we believe that this kind of visualization can be more interpretable. For example, let’s see all the 14 observations together.
![]()
Interpretation
According to the filling colours we can compare both observations and distributional values.
The Enriched plot
We propose to add information about dispersion and skewness.
The dispersion
Each variable in the dataset may have a different dispersion. Each distributional variable has its dispersion accounted by its proper standard deviation \(\sigma_{ij}\). We normalize each standard deviation \(\sigma_{ij}\) by the maximum standard deviation of observed for the the \(j\)-th variable \(\max(\sigma_{ij})\) where \(i=1,\ldots,N\). A segment, centered in the respective sector, allow to perceive the dispersion associated with each distribution.
The skewness
Each \(y_{ij}\) has its skewness value computed via the Third standardized moment \(\gamma_{ij}\).
We represent the skewness of \(y_{ij}\) external to the dashed circle if it is positive, while it is positioned internally if it is negative. The distance from the dashed circle represent the absolute value of the skewness index. If the segment is very close to the dashed circle, it means that the distribution is almost symmetric.
An example applied to Hierarchical clustering
Thank you !
